Automatically Creating Biomedical Bibliographic Records from Printed Volumes of Old Indexes
نویسنده
چکیده
To provide online access to citations from old hardcopy indexes published from 1879 through 1965, an R&D division of the National Library of Medicine (NLM) is developing an automated system to convert bibliographic information in volumes of the printed Quarterly Cumulative Index Medicus (QCIM) to machine-readable form for inclusion in the OLDMEDLINE® database. The system processes images scanned from a QCIM volume, segments and labels the image records, identifies multiple occurrences of the same record in the volume, and creates unique citation records. The record segmentation and labeling technology is based on a smearing bottom-up approach for text block segmentation, the document page layout formats, and a set of rules for record labeling that is derived from the QCIM document format guideline. Since bibliographic information can be arranged as both “author entries” and “subject entries” in a QCIM document, the duplicate records have to be detected and combined to create a single unique citation. The duplicate records are identified based on matching “cross-reference” information such as author names, journal title abbreviation, volume, pagination, month, and year among different entries of the same citation. The “cross-reference” information can also be used to correct OCR errors resulting in improving the quality of citations created. The performance of the system has been evaluated using a QCIM volume published in 1929 that consists of 95,717 citation records. Evaluation shows the technical and cost feasibility of building the proposed data conversion system.
منابع مشابه
الگوی ملزومات کارکردی پیشینههای کتابشناختی: شیوهای نوین در تنظیم عناصر کتابشناختی
Functional Requirements for Bibliographic Records (FRBR) is a conceptual model for the arrangement of bibliographic records in catalogs and databases which was proposed in IFLA 1997, following a plan for revising Anglo-American Cataloging Rules (AACR). This model is inclined to be separated from the other cataloging rules, and uses a new structure for storing and displaying bibliographic record...
متن کاملHistorical Author Affiliations Assist Verification of Automatically Generated MEDLINE® Citations
High OCR error rates encountered in author affiliations increase the manual labor needed to verify MEDLINE citations automatically created from scanned journal articles. This is due to poor OCR recognition of the small text and italics frequently used in printed affiliations. Using author-affiliation relationships found in existing MEDLINE records, the SeekAffiliation (SA) program automatically...
متن کاملText Verification in an Automated System for the Extraction of Bibliographic Data
An essential stage in any text extraction system is the manual verification of the printed material converted by OCR. This proves to be the most labor-intensive step in the process. In a system built and deployed at the National Library of Medicine to automatically extract bibliographic data from scanned biomedical journals, alternative means were considered to validate the text. This paper des...
متن کاملA Feasibility Study of Resource Description and Access (RDA) Implementation in Manuscripts’ Bibliographic Records in Iran
This study was conducted to investigate Feasibility of Resource Description and Access (RDA) implementation in manuscripts’ bibliographic records.Paper type: This research is a practical (applicable research)The present research is based on the Research and Development based on documentary and the comparative approach. Findings: The findings prove that out of the identified el...
متن کاملRetrospective Conversion of Old Bibliographic Catalogues
This paper describes a framework for retrospective document conversion in the library domain. Drawing on the experience and insight gained from the more project launched over the present decade by the European Commission, it outlines the requirements for solving the problem of retroconversion of old catalogues in unimarc format. Based on ocr technique and automatic structure recognition, the sy...
متن کامل